Integration of Speech & Video: Applications for Lip Synch: Lip Movement Synthesis & Time Warping

نویسنده

  • Jon P. Nedel
چکیده

Throughout the past several decades, much research has been done in the area of signal processing. Two of the most popular areas within this field have been applications for speech recognition and image processing. Due to these extended efforts, today there are systems that can accurately recognize and transcribe the daily television news programs that are broadcast to our homes. There are also systems that can accurately locate and track the faces within those same news programs. Recently, a new field has emerged which focuses on combining the disciplines of speech recognition and image processing in a practical way. This interest has sparked research in a broad range of new application areas, including: • Enhancement of speech recognition via visual cues • Animation of interactive talking agents to facilitate human-computer interaction • Synchronization of audio and video tracks in the film industry • Correction of lip movement for films dubbed into a foreign language and in video conferencing This paper will discuss some of the current efforts in integrating speech and video in a practical way. It will briefly discuss some image processing methods for extraction of lip coordinates and lip features from video. The focus of this investigation, however, is the use of speech recognition and other signal processing techniques in the development of two practical systems: one for lip movement synthesis and the other for lip synchronization. First will be the use of a speech recognition technique (Viterbi forced alignment) to segment the lip features extracted from video on a phoneme by phoneme basis. Once the video features are segmented, appropriate models can then be created to link the audio and video features together. Effective models can be built on a very limited amount of training data. Next will be the development and description of a system for the creation of synthetic lip features based on information contained in the speech recognizer output and the models discussed earlier. These features can then be used to automatically generate accurate and believable synthetic lip movements that correspond to any audio speech waveform. Also, a separate system to automatically synchronize lip motion and voice is under development. This system is based on dynamic time warping techniques on the output of the speech recognizer and the models that relate the audio and video features. Finally, there will be some discussion about the methods for performance evaluation of such systems, as well as some ideas for future research in this and other areas of multimodal signal processing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications

Developments in dynamic contour tracking permit sparse representation of the outlines of moving contours. Given the increasing computing power of general-purpose workstations it is now possible to track human faces and parts of faces in real-time without special hardware. This paper describes a real-time lip tracker that uses a Kalman lter based dynamic contour to track the outline of the lips....

متن کامل

Real-time lip-synch face animation driven by human voice

In this demo, we present a technique for synthesizing the mouth movement from acoustic speech information. The algorithm maps the audio parameter set to the visual parameter set using the Gaussian Mixture Model and the Hidden Markov Model. With this technique, we can create smooth and realistic lip movements.

متن کامل

Development of Infrared Lip Movement Sensor for Spoken Word Recognition

Lip movement of speaker is very informative for many application of speech signal processing such as multi-modal speech recognition and password authentication without speech signal. However, in collecting multi-modal speech information, we need a video camera, large amount of memory, video interface, and high speed processor to extract lip movement in real time. Such a system tends to be expen...

متن کامل

Sight and sound persistently out of synch: stable individual differences in audiovisual synchronisation revealed by implicit measures of lip-voice integration

Are sight and sound out of synch? Signs that they are have been dismissed for over two centuries as an artefact of attentional and response bias, to which traditional subjective methods are prone. To avoid such biases, we measured performance on objective tasks that depend implicitly on achieving good lip-synch. We measured the McGurk effect (in which incongruent lip-voice pairs evoke illusory ...

متن کامل

The physiologic development of speech motor control: lip and jaw coordination.

This investigation was designed to describe the development of lip and jaw coordination during speech and to evaluate the potential influence of speech motor development on phonologic development. Productions of syllables containing bilabial consonants were observed from speakers in four age groups (i.e., 1-year-olds, 2-year-olds, 6-year-olds, and young adults). A video-based movement tracking ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999